[GSoC] Blockwise Quantization Tool #265
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a Python tool for block quantizing ONNX models. The quantized models adhere to the ONNX standard, verified using
onnx.checker.check_model(self.model, full_check=True).Additionally, these block-quantized models are compatible with the
QuantizeLinearandDequantizeLinearlayers in OpenCV, as introduced in opencv/opencv#25644, allowing them to be executed within the OpenCV DNN engine.The tool currently performs asymmetric weight-only quantization on Convolutional Layers. It's possible to specify the desired quantization block size.$[C_{out}, C_{in}, K_w, K_h] \rightarrow [C_{out}, C_{in} \times K_w \times K_h]$ .
The quantization is applied along axis 1, flattening convolutional weights
Future enhancements could extend the tool capabilities, making it more customizable and general.
The tool also provides a quantization summary, reporting the overall quantization mean squared error and the initial and final model size.
Testing
When employing a block size of 16 and normalized input images, the mean squared quantization error was found to be in the order of magnitude of$10^{-2}$ or $10^{-3}$ .
Furthermore, a qualitative assessment was conducted by quantizing some models of this repository in a blockwise manner and executing them, comparing the results with the original model and int8 model.
The findings indicate that, with even block size, the block quantized model maintains performance levels equivalent to those of the original model while achieving a reduction in model size.
Here is an example applied to face detection yunet:
Loading the following gif may take some time because of the gif size
Onnxruntime and DNN comparison
The resulting models have been tested using both
onnxruntimeandOpenCV DNN, both of which produced identical outputs for the same input data.Since onnxruntime introduced the support for blockwise quantization inference recently and such functionality has not been included in the last release, the only way to test it is to build onnxruntime from the source.
Then install with pip the resulting wheel:
To test the resulting networks with OpenCV DNN you have to build the pull request opencv/opencv#25644